Chengchang Yu
Published on

📝 Deep Analysis:The Internal Logic of LLM Safety Mechanisms

Authors

🎯 Core Four-Element Analysis

📌 Fundamental Problem

How do LLM safety alignment and jailbreak attacks actually work? Why can aligned models still be jailbroken to bypass safety guardrails? What's really happening inside these black-box models?

🔍 Key Perspective

Critical Insight: Explain safety mechanisms through intermediate hidden states rather than just final outputs. The authors discovered:

  • Ethical concepts are learned during pre-training, not alignment
  • Safety alignment's essence is building associations: connecting early-layer ethical judgments with mid-layer emotional guesses

⚙️ Key Method

Weak-to-Strong Explanation (WSE): Using weak classifiers (SVM and MLP) to analyze strong LLM's intermediate hidden states

Technical Approach:

  1. Extract last position hidden states [u_l] from each layer
  2. Use weak classifiers to judge if states are ethical
  3. Apply Logit Lens to decode mid-layer states into tokens, observing emotional evolution
  4. Propose Logit Grafting to simulate jailbreak's disruption of association stage

💡 Core Findings

Three-Stage Safety Mechanism:

  • Early Layers (0-5): Models immediately identify malicious inputs based on ethical concepts learned during pre-training (>95% accuracy)
  • Middle Layers (16-24): Alignment training associates ethical judgments with emotions (normal inputs → positive emotions; malicious inputs → negative emotions)
  • Later Layers (25-32): Refine emotions into specific rejection/response tokens

Jailbreak's Essence:

  • Jailbreak cannot deceive early-layer ethical judgments
  • Jailbreak disrupts mid-layer associations, perturbing negative emotions into positive ones
  • When positive emotions dominate mid-layers, later layers generate harmful content

📐 Method Formalization

LLM Safety = Pre-training(Ethical Concepts) + Alignment(Association Mapping) + Refinement(Stylized Output)

Where:
- Early-layer Classification = Weak_Classifier(hidden_state_l){Ethical | Unethical}
- Mid-layer Association = Ethical_Judgment × Alignment_Weight{Positive | Negative Emotion}
- Late-layer Refinement = Emotion_State{Rejection_Token | Response_Token}

Jailbreak Attack = Perturb(Mid-layer Association)Positive EmotionHarmful Output

Logit Grafting Approximation:

Jailbreak EffectReplace(Malicious_Input_Mid_Layer_State, Normal_Input_Positive_Emotion_State)

🎤 One-Sentence Summary (Core Value)

This paper uses weak classifiers to analyze LLM intermediate hidden states, revealing a three-stage safety mechanism: pre-training learns ethical concepts that enable early layers to rapidly identify malicious inputs, alignment builds ethical-emotional associations in middle layers, while jailbreak attacks bypass safety by disrupting this association process - perturbing negative emotions into positive ones to generate harmful content.


This analysis is based on the research paper "How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States" by Alibaba Group and Tsinghua University.